The Moneyball project inspired us to look at hidden factors that can be used to improve the performance of a sports player or team. The 2016 figure skating world championships was recently held in Boston at TD garden, which sparked our interest in this particular sport. Audience members would hold their breath at each jump or spin performed by the skaters, and the final results of each skater seem to be very unpredictable until the end of their performance. That led us to wonder, can the results of a sport as artistic as figure skating be predicted using past data? The International Skating Union is an international authoritative organization with public records on previous championship results as well as information on every individual who participated in any one of the worldwide events for figure skating. The ICU website gives us access to reliable data about skaters’ demographics, scores, rankings in the past and so on since 2008. We would like to apply what we have learned in data science to the field of figure skating in order to explore data from these professional competitions and predict the performance of skaters in competitions.
We used the ISU websites to download our datasets.
There are 3 types of data we used:
Example: http://www.isuresults.com/isujsstat/sb2015-16/sbtsmto.htm
Example: http://www.isuresults.com/results/fc2009/SEG001.HTM
Example: http://www.isuresults.com/bios/isufs00007684.htm
For figure skating, we have 4 categories: m - men l - ladies p - pair d - ice dance
Most of our analysis will focus on single events data for men and ladies. We decided to exclude pairs data because after loading the initial data, we realized that the sample size for pairs data is too small. Also, we would have needed to consider a lot more variables for the analysis of pairs data. Ex. if female skater is shorter than her male partner. However, we did still include doubles data in our descriptive analysis of the datasets. We excluded ice dance because the scoring rule is very different from other categories. Also there was a change in the judging system a few years ago, so we did not have enough historical data for this category.
After scrapping the datasets from the ISU website, we wrangled the raw data to create clean and useful datasets for our analysis. After cleaning the bio data, we kept only the variables that are useful for our analysis, including country, date of birth, height, and start_career. We also added new columns for continent, number of skaters in the country they’re from, and quintiles ranking countries by number of skaters. All other bio data were qualitative and would have been too difficult to run analysis on. After cleaning the event results data, we turned the dataset from a long format where short program and free skate results were in separate rows, to a wide format where the results are in the same row. We also added columns for total score. Then we combined the biography data with the event results data and season best score data. Finally we added a new variable to our datasets indicating the ranking for each skater by quintiles for the entire year in the season best score data, and for each competition for event results data. After the data was loaded and wrangled, we saved all the datasets that are to be used as Final datasets.RData. The code to scrape and wrangle the data is included in the rscript called FINAL data loading and wrangling R code.R. In this document, we will start by uploading the cleaned datasets.
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(readr)
library(countrycode)
library(knitr)
library(stats)
library(broom)
Loading already scraped and wrangled datasets
load("/Users/NathalieGuo/Downloads/FINAL datasets.RData")
For our exploratory analysis, we looked at different variables from our biography dataset and how they relate to Season Best Scores, and individual event scores.
Our data consists of information on skaters from 72 countries in 5 continents.
return(c(length(unique(bio$country)),length(unique(bio$continent))))
## [1] 72 5
We first want to know the distribution of the number of skaters in each country across the years. Based on our graphs, we can see that USA, Canada, and Russia are the top three countries where the greatest number of skaters are from. This makes sense as ice skating is one of the more popular sports in those countries. We ranked countries into quintiles based on number of skaters from that country, and used the country rank variable later in our regression analysis as a predictor for skater scores.
bio %>% filter(category=="m"&num_quintile%in%c("80~100%","60~80%")) %>%
ggplot(aes(x=reorder(country,-numofskater),y=numofskater,fill=continent))+
geom_bar(stat="identity")+xlab("Country")+ylab("number of male single skaters") +
ggtitle('Number of Male Skaters by Country')
bio %>% filter(category=="l"&num_quintile%in%c("80~100%","60~80%")) %>%
ggplot(aes(x=reorder(country,-numofskater),y=numofskater,fill=continent))+
geom_bar(stat="identity")+xlab("Country")+ylab("number of female single skaters") +
ggtitle('Number of Female Skaters by Country')
We looked at which countries had the highest number of skaters, as well as which countries had the highest Season Best Scores. If we look at the mean scores throughout the years, we see that Russia has the highest average Season Best Score for ladies. Uzbekistan has highest average Season Best Score in men. China has the highest average Season Best Score for pairs. USA has the highest number of skaters participating in both single and doubles events.
data <- SeasonBestScoreSingle.df %>%
group_by(program_type,category, country) %>%
summarise(count = n(), score_mean = mean(score), score_sd = sd(score)) %>%
mutate(group = paste(category,program_type))
#country with maximum scores
maxScoreSingle.df <- do.call(rbind,lapply(split(data,data$group),function(chunk) chunk[which.max(chunk$score_mean),]))
maxScoreSingle.df
## Source: local data frame [6 x 7]
## Groups: program_type, category [1]
##
## program_type category country count score_mean score_sd group
## (chr) (chr) (chr) (int) (dbl) (dbl) (chr)
## 1 fs l RUS 74 111.08568 18.740378 l fs
## 2 sp l RUS 74 59.69649 8.580273 l sp
## 3 to l RUS 104 161.41683 28.555741 l to
## 4 fs m UZB 5 142.53600 10.688486 m fs
## 5 sp m UZB 5 72.66000 6.997607 m sp
## 6 to m UZB 6 208.48833 19.849003 m to
#country with maximum count
maxCountSingle.df <- do.call(rbind,lapply(split(data,data$group),function(chunk) chunk[which.max(chunk$count),]))
maxCountSingle.df
## Source: local data frame [6 x 7]
## Groups: program_type, category [1]
##
## program_type category country count score_mean score_sd group
## (chr) (chr) (chr) (int) (dbl) (dbl) (chr)
## 1 fs l USA 97 102.59732 16.742392 l fs
## 2 sp l USA 97 55.35495 7.722283 l sp
## 3 to l USA 156 152.02635 23.322165 l to
## 4 fs m USA 89 132.11112 25.292853 m fs
## 5 sp m USA 90 67.44122 12.485054 m sp
## 6 to m USA 140 196.78471 34.521255 m to
data <- SeasonBestScoreDouble.df %>%
group_by(program_type,category, country) %>%
summarise(count = n(), score_mean = mean(score), score_sd = sd(score)) %>%
mutate(group = paste(category,program_type))
#country with maximum scores
maxScoreDouble.df <- do.call(rbind,lapply(split(data,data$group),function(chunk) chunk[which.max(chunk$score_mean),]))
maxScoreDouble.df
## Source: local data frame [6 x 7]
## Groups: program_type, category [1]
##
## program_type category country count score_mean score_sd group
## (chr) (chr) (chr) (int) (dbl) (dbl) (chr)
## 1 fd d AZE 3 86.46667 3.639579 d fd
## 2 sd d RUS 70 57.30243 7.674735 d sd
## 3 to d AZE 4 146.25750 10.535386 d to
## 4 fs p CHN 32 113.06281 21.157853 p fs
## 5 sp p CHN 32 60.17625 11.364523 p sp
## 6 to p CHN 50 168.59660 33.640941 p to
#country with maximum count
maxCountDouble.df <- do.call(rbind,lapply(split(data,data$group),function(chunk) chunk[which.max(chunk$count),]))
maxCountDouble.df
## Source: local data frame [6 x 7]
## Groups: program_type, category [1]
##
## program_type category country count score_mean score_sd group
## (chr) (chr) (chr) (int) (dbl) (dbl) (chr)
## 1 fd d USA 73 82.14068 15.31989 d fd
## 2 sd d USA 73 55.30671 10.48632 d sd
## 3 to d USA 117 141.22974 27.70363 d to
## 4 fs p RUS 68 108.53382 21.01386 p fs
## 5 sp p RUS 68 57.58147 10.89465 p sp
## 6 to p USA 108 141.72407 25.69334 p to
We compared Season Best Scores for the countries with the highest number of skaters (ie. Japan, USA, Russia, and Canada), to all other countries. The graphs show that there is a significant difference in Season Best Scores between these countries. Suggesting that country of origin may be a significant predictor for a skater’s scores.
plot.df <-
SeasonBestScoreSingle.df %>%
filter(program_type =="to") %>%
mutate(display_country=ifelse(country %in% c("RUS","USA", "CAN","JPN") , country, "other countries"))%>%
group_by(category, display_country,season_start_year) %>%
summarize(score = mean(score), count = n()
)
ggplot(plot.df ,aes(x=season_start_year,y=score)) +
geom_point(aes(color=display_country,group=display_country, size = count))+
facet_wrap(~category, scales = "free") +
xlab("season start year") +ylab("mean score by country") +
ggtitle("SBS for Singles by Country")
plot.df <-
SeasonBestScoreDouble.df %>%
filter(program_type =="to") %>%
mutate(display_country=ifelse(country %in% c("RUS","USA", "CAN","JPN") , country, "other countries"))%>%
group_by(category, display_country,season_start_year) %>%
summarize(score = mean(score), count = n()
)
ggplot(plot.df ,aes(x=season_start_year,y=score)) +
geom_point(aes(color=display_country,group=display_country, size = count))+
facet_wrap(~category, scales = "free") +
xlab("season start year") +ylab("mean score by country") +
ggtitle("SBS for Doubles by Country")
Based on these results, we determined that country of origin could be a potential variable for predicting a skater’s performance.
Next we were interested in including height as a potential variable for prediction, so looked at the distribution of height by gender. When we used qqnorm and qqline to check the normality of the distributions, we found that the height of female skaters is normally distributed, however, the height of male skaters is a little left-skewed.
bio %>% filter(is.na(height)==FALSE) %>% ggplot(aes(height)) +
geom_histogram()+facet_grid(.~category)+stat_bin(bins=30) +
ggtitle('Heights by Gender')
bio %>% filter(is.na(height)==FALSE) %>%
ggplot(aes(continent,height,color=continent)) +
geom_boxplot()+ facet_grid(.~category) +
ggtitle('Heights by Gender')
Next, we looked at the distribution of Season Best Scores and Event Scores by categories: Men Singles, Ladies Singles, Pairs, Ice Dancing.
#let see if the year distribution is the same
plot.df <- SeasonBestScoreSingle.df %>%
filter(program_type == "to") %>%
mutate(group = paste(category,program_type))
ggplot(filter(plot.df,category == "m") ,aes(score)) +
geom_histogram()+
facet_wrap(~season_start_year, scales = "fixed") +
ggtitle("Season Best Score Distribution for every year for men total score")
qqnorm(filter(plot.df,category == "m"&season_start_year==2015)$score)
qqline(filter(plot.df,category == "m"&season_start_year==2015)$score)
ggplot(filter(plot.df,category == "l") ,aes(score)) +
geom_histogram()+
facet_wrap(~season_start_year, scales = "fixed") +
ggtitle("Season Best Score Distribution for every year for ladies total score")
qqnorm(filter(plot.df,category == "l"&season_start_year==2015)$score)
qqline(filter(plot.df,category == "l"&season_start_year==2015)$score)
plot.df <- SeasonBestScoreDouble.df %>%
filter(program_type == "to") %>%
mutate(group = paste(category,program_type))
ggplot(filter(plot.df,category == "p") ,aes(score)) +
geom_histogram()+
facet_wrap(~season_start_year, scales = "fixed") +
ggtitle("Season Best Score Distribution for every year for pair program total score")
qqnorm(filter(plot.df,category == "p"&season_start_year==2015)$score)
qqline(filter(plot.df,category == "p"&season_start_year==2015)$score)
ggplot(filter(plot.df,category == "d") ,aes(score)) +
geom_histogram()+
facet_wrap(~season_start_year, scales = "fixed") +
ggtitle("Season Best Score Distribution for every year for ice dancing total score")
qqnorm(filter(plot.df,category == "d"&season_start_year==2015)$score)
qqline(filter(plot.df,category == "d"&season_start_year==2015)$score)
For men, we can see that the distribution of Season Best Score is fairly normal across the years. With the average increasing over the years.
For ladies, the SBS distribution in earlier years are more right skewed, but have become more normally distribution since 2012. However from the QQ plot, we can see that tails still do not follow a normal distribution as well the men’s distribution do.
For both pairs and ice dancing, the SBS distribution do not follow normality, as seen on the QQ plot. There are many “gaps” in the distribution. However, the lack of normal distribution could be due to the fact that pairs and ice dance have fewer skaters than men singles and ladies singles.
Looking at average individual event scores over the years by gender, we can see that men consistently score higher than women.
single_gender<-EventResultSingle_wr%>%
group_by(program_type,category,season_start_year)%>%
summarize(TSS=mean(TSS,na.rm=TRUE),TES=mean(TES,na.rm=TRUE),PCS=mean(PCS,na.rm=TRUE),
SS=mean(SS,na.rm=TRUE),TR=mean(TR,na.rm=TRUE),PE=mean(PE,na.rm=TRUE),
CH=mean(CH,na.rm=TRUE),IN=mean(IN,na.rm=TRUE))
single_gender%>%
ggplot(aes(x=season_start_year,y=TSS,group=category,color=category))+
geom_line()+geom_point()+ggtitle('TSS Scores by Gender')+facet_wrap(~program_type)+
scale_x_continuous(breaks=c(seq(2005,2015, by=2)))
These results suggest that we should run separate predictive models for Men and Women.
To understand the score breakdown, each competition includes scores for two events, Short Program and Free Skate. For each event, there’s a total score (TSS) which is a sum of the total elements score (TES), and the program component score (PCS). The program component score includes scores for 5 individual component: skating skills (SS), Transitions (TR), Performance/Execution (PE), Choreography / Composition (CH), and Interpretation (IN). Each component is marked with a value from 0 to 10 in 0.25 increments. These five marks are then multiplied by a factor depending on the type of program and level. For senior ladies and pairs, the factor is 0.8 for the short program and 1.6 for the long program. This means that PCS is more important in the long program. *Note:Since we will not focus on ice dance, we will not look at free dance (fd) and short dance (sd).
Source: http://gofigureskating.com/compete/scoring.html
Please note that we only have SBS for “Short Program” and “Free Skating” since 2011.
ggplot(data= filter(SeasonBestScoreSingle.df, season_start_year >= 2011) ,aes(x=category,y=score))+
geom_boxplot(aes(color=category)) +ggtitle("SBS for Singles") +xlab("score type") +ylab("SBS") +
facet_wrap(~program_type, scales = "fixed")
ggplot(data= filter(SeasonBestScoreDouble.df, season_start_year >= 2011) ,aes(x=program_type,y=score))+
geom_boxplot(aes(color=program_type)) +ggtitle("SBS for Doubles") +xlab("score type") +ylab("SBS") +
facet_wrap(~category, scales = "fixed")
Overall, we saw that free skating has higher SBS than short program for each category.
#create a combination table
todata <-
SeasonBestScoreSingle.df%>%
filter(program_type == "to") %>%
mutate(event_to = event, date_to= date, score_to = score) %>%
select (first_name, last_name, country, season_start_year, category, event_to, date_to, score_to)
fsdata <-
SeasonBestScoreSingle.df%>%
filter(program_type == "fs") %>%
mutate(event_fs = event, date_fs= date, score_fs = score) %>%
select (first_name, last_name, country, season_start_year,category, event_fs, date_fs, score_fs)
spdata <-
SeasonBestScoreSingle.df%>%
filter(program_type == "sp") %>%
mutate(event_sp = event, date_sp= date, score_sp = score) %>%
select (first_name, last_name, country, season_start_year,category, event_sp, date_sp, score_sp)
alldata <- full_join(todata, fsdata)
alldata <- full_join(alldata,spdata)
data<- alldata %>%
filter(season_start_year >= 2011) %>%
filter(!is.na(score_fs)&!is.na(score_sp)) %>%
mutate(sameEventForFSandSP = ifelse(event_fs==event_sp, TRUE, FALSE)) %>%
mutate (score_diff = score_fs - score_sp)
plot.df <-
data %>%
group_by(season_start_year, category, sameEventForFSandSP) %>%
summarise(count= n(), score_to_mean = mean(score_to) )
ggplot(plot.df ,aes(x=season_start_year,y=score_to_mean)) +
geom_line(aes(color=sameEventForFSandSP,group=sameEventForFSandSP))+
facet_wrap(~category, scales = "free") +
ggtitle("Skaters who have SBS from same event vs. not") +xlab("season start year") +ylab("mean SBS total score")
We can see that in general skaters who have their Season Best Scores for free skating and short program from different events usually have higher total Season Best Scores.
We looked at Program Component Scores, Total Elements Scores, and Total Scores over the years for each of the four competitions in the data set. Based on the graphs, we can see that GPF consistently has the highest scores, followed by World Championships, European Championships, and Four Continents.
#Scores over the year by event
singles_scores<-EventResultSingle_wr%>%
group_by(program_type,event_code,season_start_year)%>%
summarize(TSS=mean(TSS,na.rm=TRUE),TES=mean(TES,na.rm=TRUE),PCS=mean(PCS,na.rm=TRUE),
SS=mean(SS,na.rm=TRUE),TR=mean(TR,na.rm=TRUE),PE=mean(PE,na.rm=TRUE),
CH=mean(CH,na.rm=TRUE),IN=mean(IN,na.rm=TRUE))
singles_scores%>%
ggplot(aes(x=season_start_year,y=TSS,group=event_code,color=event_code))+
geom_line()+geom_point()+ggtitle('TSS Scores')+facet_wrap(~program_type) +
scale_x_continuous(breaks=c(seq(2005,2015, by=2)))
singles_scores%>%
ggplot(aes(x=season_start_year,y=TES,group=event_code,color=event_code))+
geom_line()+geom_point()+ggtitle('TES Scores')+facet_wrap(~program_type)+
scale_x_continuous(breaks=c(seq(2005,2015, by=2)))
singles_scores%>%
ggplot(aes(x=season_start_year,y=PCS,group=event_code,color=event_code))+
geom_line()+geom_point()+ggtitle('PCS Scores')+facet_wrap(~program_type)+
scale_x_continuous(breaks=c(seq(2005,2015, by=2)))
Below is a summary of the average number events per skater, as well as number of participants for each event by gender. We can see that the Grand Prix Final has a lot fewer participants than the other competitions. This is because the Grand Prix Final is a cumulation of the ISU Grand Prix of Figure Skating series, which consists of the Skate America, Skate Canada International, Trophee Eric Bompard, Cup of China, Cup of Russia, and NHK Trophy competitions. Only the top 6 skaters from each discipline go on to compete in the final.
length(unique(EventResultSingle_wr$full_name))
## [1] 686
#686 total number of single's event attendents
#looking at events per skater
Unique_events<-EventResultSingle_wr%>%group_by(full_name)%>%
summarize(events=length(event_year))
summary(Unique_events)
## full_name events
## Length:686 Min. : 1.00
## Class :character 1st Qu.: 2.00
## Mode :character Median : 2.00
## Mean : 5.02
## 3rd Qu.: 6.00
## Max. :34.00
hist(Unique_events$events, breaks=25, main='Average Events per Skater', xlab='Unique Events',
ylab='Number of Skaters')
#Summary of Number of skaters for each event
EventResultSingle_wr%>%group_by(season_start_year,event_code,category)%>%
summarize(count=n())%>%
ggplot(aes(x=season_start_year,y=count,fill=category))+
geom_bar(stat='identity',position=position_dodge())+facet_wrap(~event_code)
EventResultSingle_wr%>%group_by(event_code,category)%>%summarize(n())
## Source: local data frame [8 x 3]
## Groups: event_code [?]
##
## event_code category n()
## (chr) (chr) (int)
## 1 ec Ladies 553
## 2 ec Men 577
## 3 fc Ladies 547
## 4 fc Men 517
## 5 gpf Ladies 95
## 6 gpf Men 91
## 7 wc Ladies 478
## 8 wc Men 586
We ran a correlation matrix of the component scores Free skating against component scores for Short Program. The purpose of this was to find which individual components were more correlated with the total score, and also which components were more correlated with each other. We wanted to see if there are certain components that correlate with the total score more, and recommend that skaters invest more time in improving those components. However, our results showed a high correlation with between all the components, indicating that all components are equally important for skaters.
x<-single_event%>%select(SS_FS:IN_FS,SS_SP:IN_SP)
x<-na.omit(x)
cor_matrix<-data.frame(cor(x))
cov_matrix<-data.frame(cov(x))
round(cor_matrix, digits=2)
## SS_FS TR_FS PE_FS CH_FS IN_FS SS_SP TR_SP PE_SP CH_SP IN_SP
## SS_FS 1.00 0.99 0.99 0.99 0.99 0.97 0.97 0.97 0.97 0.97
## TR_FS 0.99 1.00 0.99 0.99 0.99 0.96 0.97 0.97 0.97 0.97
## PE_FS 0.99 0.99 1.00 0.99 0.99 0.96 0.96 0.96 0.96 0.96
## CH_FS 0.99 0.99 0.99 1.00 1.00 0.97 0.97 0.97 0.97 0.97
## IN_FS 0.99 0.99 0.99 1.00 1.00 0.96 0.96 0.96 0.96 0.97
## SS_SP 0.97 0.96 0.96 0.97 0.96 1.00 0.99 0.99 0.99 0.99
## TR_SP 0.97 0.97 0.96 0.97 0.96 0.99 1.00 0.99 0.99 0.99
## PE_SP 0.97 0.97 0.96 0.97 0.96 0.99 0.99 1.00 0.99 0.99
## CH_SP 0.97 0.97 0.96 0.97 0.96 0.99 0.99 0.99 1.00 1.00
## IN_SP 0.97 0.97 0.96 0.97 0.97 0.99 0.99 0.99 1.00 1.00
round(cov_matrix, digits=2)
## SS_FS TR_FS PE_FS CH_FS IN_FS SS_SP TR_SP PE_SP CH_SP IN_SP
## SS_FS 1.65 1.68 1.70 1.71 1.75 1.54 1.55 1.58 1.58 1.63
## TR_FS 1.68 1.74 1.75 1.76 1.81 1.57 1.60 1.62 1.61 1.67
## PE_FS 1.70 1.75 1.79 1.78 1.84 1.58 1.60 1.63 1.62 1.68
## CH_FS 1.71 1.76 1.78 1.80 1.85 1.60 1.62 1.64 1.64 1.70
## IN_FS 1.75 1.81 1.84 1.85 1.91 1.64 1.66 1.69 1.69 1.75
## SS_SP 1.54 1.57 1.58 1.60 1.64 1.52 1.53 1.55 1.55 1.60
## TR_SP 1.55 1.60 1.60 1.62 1.66 1.53 1.56 1.57 1.57 1.62
## PE_SP 1.58 1.62 1.63 1.64 1.69 1.55 1.57 1.60 1.59 1.65
## CH_SP 1.58 1.61 1.62 1.64 1.69 1.55 1.57 1.59 1.60 1.65
## IN_SP 1.63 1.67 1.68 1.70 1.75 1.60 1.62 1.65 1.65 1.71
In this section, we want to explore the associations between biography variables and each outcome of interest by using linear regression models stratified by sex. Biography variables include current age at event year, height, career length in years and category of country regarding the quintiles of number of sex-specific skaters. Outcomes are event total score, season best total score, season best score of short program and season best score of free skating.
Before we built models, ggplots were used to visualize the associations. Univariate linear regression models were built for each combination of biography variables and outcomes. Since all the associations between single predictor and outcomes are statistically significant, we included them all in the final multivariate linear regression models.
bsb<- seasonbestsingle_bio
bsb %>% filter(is.na(name)==TRUE) %>% summarise(n())
## Source: local data frame [1 x 1]
##
## n()
## (int)
## 1 260
Out of 7295 observations, there are 260 missing biography info.
##filter out observations with missing biography info
bsb <- bsb %>% filter(is.na(name)==FALSE)
bsb_to<- bsb %>% filter(program_type=="to")
bsb_to %>% ggplot(aes(age,score,color=category))+geom_point()+
geom_smooth(method="lm")+facet_grid(.~category) + ggtitle('Association between Age and Total Score')
bsb_to %>% group_by(category) %>% do(tidy(lm(score~ age,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 60.446374 4.7691940 12.674337 3.292914e-35
## 2 l age 3.407357 0.2762010 12.336511 1.603907e-33
## 3 m (Intercept) 42.016774 6.1057150 6.881549 9.128167e-12
## 4 m age 6.491712 0.3195556 20.314815 4.893443e-80
For both ladies and men, age has a significantly positive relationship with season best total score.
bsb_to %>% ggplot(aes(height,score,color=category))+ geom_point()+geom_smooth(method="lm")+
facet_grid(.~category) + ggtitle('Association between Height and Total Score')
bsb_to %>% group_by(category) %>% do(tidy(lm(score~ height,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 231.3744361 24.3311874 9.509377 6.754287e-21
## 2 l height -0.6926333 0.1502604 -4.609552 4.355803e-06
## 3 m (Intercept) 335.5009789 35.9249173 9.338949 4.188183e-20
## 4 m height -0.9836573 0.2072098 -4.747156 2.296034e-06
For both ladies and men, height has a significantly negative relationship with season best total score.
bsb_to %>% ggplot(aes(career_length,score,color=category))+ geom_point()+geom_smooth(method="lm")+
facet_grid(.~category) + ggtitle('Association between Career Length and Total Score')
bsb_to %>% group_by(category) %>% do(tidy(lm(score~ career_length,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 75.478406 2.7697810 27.25068 1.529613e-134
## 2 l career_length 3.717067 0.2242265 16.57729 4.617092e-57
## 3 m (Intercept) 90.861278 3.6268479 25.05241 4.707949e-113
## 4 m career_length 5.666503 0.2633873 21.51396 6.819334e-88
For both ladies and men, career length has a significantly positive relationship with season best total score.
bsb_to %>% ggplot(aes(num_quintile,score,color=category))+
geom_boxplot()+facet_grid(category~.) + ggtitle('Association between Country Rank and Total Score')
bsb_to %>% group_by(category) %>%
do(tidy(lm(score~ num_quintile,data= .),data= .))
## Source: local data frame [10 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic
## (chr) (chr) (dbl) (dbl) (dbl)
## 1 l (Intercept) 89.52679 1.524551 58.723400
## 2 l num_quintile20~40% 13.36064 2.111970 6.326147
## 3 l num_quintile40~60% 18.48902 2.223786 8.314211
## 4 l num_quintile60~80% 44.78198 2.042556 21.924478
## 5 l num_quintile80~100% 62.71311 2.147680 29.200399
## 6 m (Intercept) 132.63691 2.540832 52.202162
## 7 m num_quintile20~40% 11.09490 3.604352 3.078195
## 8 m num_quintile40~60% 14.65078 3.639055 4.025985
## 9 m num_quintile60~80% 54.50751 3.292861 16.553237
## 10 m num_quintile80~100% 65.25258 3.627232 17.989637
## Variables not shown: p.value (dbl)
Compared to skaters from country in the lowest quintile of number of skaters, skaters from country in higher quintile have greater season best total score.
##Multivariate linear regression model for ladies
fit_to_l<- bsb_to %>% filter(category=="l") %>%
lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_to_l)
##
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.225 -17.745 -2.402 16.589 83.630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 138.3338 18.2505 7.580 5.89e-14 ***
## age 0.7990 0.3791 2.107 0.0352 *
## height -0.5436 0.1114 -4.877 1.18e-06 ***
## career_length 2.5670 0.3231 7.945 3.68e-15 ***
## num_quintile20~40% 11.2466 2.0468 5.495 4.56e-08 ***
## num_quintile40~60% 13.0063 2.1694 5.995 2.52e-09 ***
## num_quintile60~80% 40.2693 1.9659 20.484 < 2e-16 ***
## num_quintile80~100% 56.6217 2.1371 26.495 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.92 on 1570 degrees of freedom
## (91 observations deleted due to missingness)
## Multiple R-squared: 0.5035, Adjusted R-squared: 0.5013
## F-statistic: 227.4 on 7 and 1570 DF, p-value: < 2.2e-16
##Multivariate linear regression model for Men
fit_to_m<- bsb_to %>% filter(category=="m") %>%
lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_to_m)
##
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -134.556 -21.993 -3.643 18.345 135.851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 224.1133 26.9699 8.310 2.46e-16 ***
## age 2.7098 0.4818 5.624 2.30e-08 ***
## height -1.0082 0.1526 -6.608 5.72e-11 ***
## career_length 3.0343 0.4061 7.471 1.48e-13 ***
## num_quintile20~40% 6.6950 3.0849 2.170 0.0302 *
## num_quintile40~60% 12.5680 3.1391 4.004 6.60e-05 ***
## num_quintile60~80% 42.8846 2.8393 15.104 < 2e-16 ***
## num_quintile80~100% 54.8501 3.1609 17.353 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.35 on 1259 degrees of freedom
## (55 observations deleted due to missingness)
## Multiple R-squared: 0.5059, Adjusted R-squared: 0.5031
## F-statistic: 184.1 on 7 and 1259 DF, p-value: < 2.2e-16
In the final multivariate models for men and ladies, each predictor still has statistically significant relationship with season best total score after adjusting the rest of the biography variables. Nearly 50% of the variance of season best total score can be explained by age, height, career length and country rank.
##Comparing the point estimate of coefficients with 95% conficence interval between sex.
fit_to<- bsb_to %>% group_by(category) %>%
do(tidy(lm(score~ age+height+career_length+num_quintile, data= .), data= .))
fit_to<- tbl_df(fit_to)
fit_to %>% filter(term=="age") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('Estimate based on Age')+geom_hline(yintercept = 0)
fit_to %>% filter(term=="height") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('Estimate based on Height')+geom_hline(yintercept = 0)
fit_to %>% filter(term=="career_length") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+
geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+
coord_fixed(ratio=1.5) + ggtitle('Estimate based on Career Length')+geom_hline(yintercept = 0)
fit_to %>% filter(term%in%c("num_quintile20~40%","num_quintile40~60%",
"num_quintile60~80%","num_quintile80~100%")) %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(category~.) + ggtitle('Estimate based on Country Rank')+geom_hline(yintercept = 0)
We can see the magnitude of the associations of age and height with season best total score varies by sex.
bsb_sp<- bsb %>% filter(program_type=="sp")
bsb_sp %>% ggplot(aes(age,score,color=category))+geom_point()+
geom_smooth(method="lm")+facet_grid(.~category)+
ggtitle('Association between Age and SP Score')
bsb_to %>% group_by(category) %>% do(tidy(lm(score~ age,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 60.446374 4.7691940 12.674337 3.292914e-35
## 2 l age 3.407357 0.2762010 12.336511 1.603907e-33
## 3 m (Intercept) 42.016774 6.1057150 6.881549 9.128167e-12
## 4 m age 6.491712 0.3195556 20.314815 4.893443e-80
For both ladies and men, age has a significantly positive relationship with season best score of short program.
bsb_sp %>% ggplot(aes(height,score,color=category))+
geom_point()+geom_smooth(method="lm")+facet_grid(.~category) +
ggtitle('Association between Height and SP score')
bsb_sp %>% group_by(category) %>% do(tidy(lm(score~ height,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 100.7323231 10.38638524 9.698497 2.223014e-21
## 2 l height -0.3536300 0.06415242 -5.512341 4.430405e-08
## 3 m (Intercept) 112.8138506 14.89958200 7.571612 9.135357e-14
## 4 m height -0.3196054 0.08612681 -3.710871 2.192384e-04
For both ladies and men, height has a significantly negative relationship with season best score of short program.
bsb_sp %>% ggplot(aes(career_length,score,color=category))+
geom_point()+geom_smooth(method="lm")+facet_grid(.~category) +
ggtitle('Association between Career Length and SP Score')
bsb_sp %>% group_by(category) %>% do(tidy(lm(score~ career_length,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 28.412361 1.2620557 22.51276 4.228408e-92
## 2 l career_length 1.286085 0.1024824 12.54933 8.685723e-34
## 3 m (Intercept) 32.760015 1.5197246 21.55655 2.336999e-83
## 4 m career_length 1.887580 0.1097136 17.20461 1.545787e-57
For both ladies and men, career length has a significantly positive relationship with season best score of short program.
bsb_sp %>% ggplot(aes(num_quintile,score,color=category))+
geom_boxplot()+facet_grid(category~.) + ggtitle('Association between Country Rank and SP Score')
bsb_sp %>% group_by(category) %>%
do(tidy(lm(score~ num_quintile,data= .),data= .))
## Source: local data frame [10 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic
## (chr) (chr) (dbl) (dbl) (dbl)
## 1 l (Intercept) 33.123612 0.6447213 51.376633
## 2 l num_quintile20~40% 4.701086 0.9068478 5.183986
## 3 l num_quintile40~60% 6.270518 0.9635717 6.507578
## 4 l num_quintile60~80% 15.303851 0.8732485 17.525195
## 5 l num_quintile80~100% 22.731975 0.9266348 24.531752
## 6 m (Intercept) 46.767396 1.0066236 46.459666
## 7 m num_quintile20~40% 3.993733 1.4350153 2.783060
## 8 m num_quintile40~60% 5.317126 1.5008248 3.542803
## 9 m num_quintile60~80% 19.187754 1.3595138 14.113689
## 10 m num_quintile80~100% 21.754738 1.4830991 14.668432
## Variables not shown: p.value (dbl)
Compared to skaters from country in the lowest quintile of number of skaters, skaters from country in higher quintile have greater season best score of short program.
##Multivariate linear regression model for ladies
fit_sp_l<- bsb_sp %>% filter(category=="l") %>%
lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_sp_l)
##
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9226 -6.3346 -0.3698 6.1869 26.1133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.96516 7.83919 8.160 9.53e-16 ***
## age 0.19328 0.16087 1.201 0.23
## height -0.27428 0.04812 -5.700 1.55e-08 ***
## career_length 1.01311 0.13975 7.249 8.11e-13 ***
## num_quintile20~40% 3.98783 0.88314 4.515 7.03e-06 ***
## num_quintile40~60% 4.30449 0.93948 4.582 5.16e-06 ***
## num_quintile60~80% 14.08897 0.84310 16.711 < 2e-16 ***
## num_quintile80~100% 20.30167 0.92630 21.917 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.883 on 1051 degrees of freedom
## (69 observations deleted due to missingness)
## Multiple R-squared: 0.5088, Adjusted R-squared: 0.5055
## F-statistic: 155.5 on 7 and 1051 DF, p-value: < 2.2e-16
##Multivariate linear regression model for Men
fit_sp_m<- bsb_sp %>% filter(category=="m") %>%
lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_sp_m)
##
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.764 -8.130 -0.810 6.791 41.601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.16953 11.25543 7.300 6.40e-13 ***
## age 0.99115 0.19653 5.043 5.56e-07 ***
## height -0.37479 0.06457 -5.804 9.02e-09 ***
## career_length 0.99938 0.16907 5.911 4.85e-09 ***
## num_quintile20~40% 2.63938 1.23667 2.134 0.033095 *
## num_quintile40~60% 4.30340 1.30271 3.303 0.000994 ***
## num_quintile60~80% 15.46956 1.16964 13.226 < 2e-16 ***
## num_quintile80~100% 17.98100 1.28734 13.968 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.68 on 883 degrees of freedom
## (41 observations deleted due to missingness)
## Multiple R-squared: 0.4937, Adjusted R-squared: 0.4897
## F-statistic: 123 on 7 and 883 DF, p-value: < 2.2e-16
In the final multivariate models for men and ladies,nearly all the predictors still have statistically significant relationship with season best score of short program after adjusting the rest of the biography variables, except for the association of age with the outcome among ladies. Around 50% of the variance of season best score of short program can be explained by age, height, career length and country rank.
##Comparing the point estimate of coefficients with 95% conficence interval between sex.
fit_sp<- bsb_sp %>% group_by(category) %>%
do(tidy(lm(score~ age+height+career_length+num_quintile,data= .),data=.))
fit_sp<- tbl_df(fit_sp)
fit_sp %>% filter(term=="age") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('SP Estimate based on Age')+geom_hline(yintercept = 0)
fit_sp %>% filter(term=="height") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('SP Estimate based on Height')+geom_hline(yintercept = 0)
fit_sp %>% filter(term=="career_length") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+
geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+
coord_fixed(ratio=1.5) + ggtitle('SP Estimate based on Career Length')+geom_hline(yintercept = 0)
fit_sp %>% filter(term%in%c("num_quintile20~40%","num_quintile40~60%",
"num_quintile60~80%","num_quintile80~100%")) %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(category~.) +
ggtitle('SP Estimate based on Country Rank')+geom_hline(yintercept = 0)
We can see the magnitude of the associations of age with season best score of short program varies by sex.
bsb_fs<- bsb %>% filter(program_type=="fs")
bsb_fs %>% ggplot(aes(age,score,color=category))+
geom_point()+geom_smooth(method="lm")+facet_grid(.~category)+
ggtitle('Association between Age and FS Score')
bsb_fs %>% group_by(category) %>% do(tidy(lm(score~ age,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 46.013022 4.2370918 10.859576 3.719303e-26
## 2 l age 1.941041 0.2459612 7.891654 7.228734e-15
## 3 m (Intercept) 30.935479 5.0415615 6.136091 1.273917e-09
## 4 m age 4.254407 0.2640957 16.109342 2.115192e-51
For both ladies and men, age has a significantly positive relationship with season best score of free skating.
bsb_fs %>% ggplot(aes(height,score,color=category))+
geom_point()+geom_smooth(method="lm")+facet_grid(.~category)+
ggtitle('Association between Height and FS Score')
bsb_fs %>% group_by(category) %>% do(tidy(lm(score~ height,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 192.5054564 20.6228505 9.334571 5.919500e-20
## 2 l height -0.6972585 0.1273909 -5.473379 5.529755e-08
## 3 m (Intercept) 223.9802303 30.5294436 7.336532 5.080468e-13
## 4 m height -0.6503897 0.1765796 -3.683265 2.446831e-04
For both ladies and men, height has a significantly negative relationship with season best score of free skating.
bsb_fs %>% ggplot(aes(career_length,score,color=category))+
geom_point()+geom_smooth(method="lm")+facet_grid(.~category) +
ggtitle('Association between Career Length and FS Score')
bsb_fs %>% group_by(category) %>% do(tidy(lm(score~ career_length,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 l (Intercept) 50.844946 2.4950782 20.37810 8.000505e-78
## 2 l career_length 2.469671 0.2037479 12.12121 1.047491e-31
## 3 m (Intercept) 62.722539 3.0637166 20.47270 3.957547e-76
## 4 m career_length 3.760502 0.2230866 16.85669 2.851872e-55
For both ladies and men, career length has a significantly positive relationship with season best score of free skating.
bsb_fs %>% ggplot(aes(num_quintile,score,color=category))+
geom_boxplot()+facet_grid(category~.) +
ggtitle('Association between Country Rank and FS Score')
bsb_fs %>% group_by(category) %>%
do(tidy(lm(score~ num_quintile,data= .),data= .))
## Source: local data frame [10 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic
## (chr) (chr) (dbl) (dbl) (dbl)
## 1 l (Intercept) 58.615674 1.285548 45.595857
## 2 l num_quintile20~40% 10.145326 1.807680 5.612344
## 3 l num_quintile40~60% 12.380370 1.898663 6.520572
## 4 l num_quintile60~80% 30.892129 1.731626 17.839951
## 5 l num_quintile80~100% 44.780335 1.822302 24.573492
## 6 m (Intercept) 89.676889 2.066315 43.399420
## 7 m num_quintile20~40% 7.285817 2.964873 2.457379
## 8 m num_quintile40~60% 10.355016 3.081852 3.359998
## 9 m num_quintile60~80% 37.971633 2.758828 13.763683
## 10 m num_quintile80~100% 43.900780 2.997436 14.646113
## Variables not shown: p.value (dbl)
Compared to skaters from country in the lowest quintile of number of skaters, skaters from country in higher quintile have greater season best score of short program.
##Multivariate linear regression model for ladies
fit_fs_l<- bsb_fs %>% filter(category=="l") %>%
lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_fs_l)
##
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.386 -12.310 -0.954 11.644 58.737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.57262 15.55604 7.494 1.45e-13 ***
## age 0.43612 0.32042 1.361 0.174
## height -0.51769 0.09557 -5.417 7.56e-08 ***
## career_length 1.83409 0.27792 6.599 6.64e-11 ***
## num_quintile20~40% 9.18052 1.77569 5.170 2.82e-07 ***
## num_quintile40~60% 8.68480 1.86914 4.646 3.82e-06 ***
## num_quintile60~80% 28.74391 1.69065 17.002 < 2e-16 ***
## num_quintile80~100% 40.33783 1.84811 21.826 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.39 on 1017 degrees of freedom
## (69 observations deleted due to missingness)
## Multiple R-squared: 0.5115, Adjusted R-squared: 0.5081
## F-statistic: 152.1 on 7 and 1017 DF, p-value: < 2.2e-16
##Multivariate linear regression model for men
fit_fs_m<- bsb_fs %>% filter(category=="m") %>%
lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_fs_m)
##
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.181 -15.812 -3.602 13.092 94.988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.2113 23.0838 7.027 4.35e-12 ***
## age 2.0258 0.4052 5.000 6.97e-07 ***
## height -0.7537 0.1324 -5.694 1.72e-08 ***
## career_length 1.8892 0.3478 5.432 7.31e-08 ***
## num_quintile20~40% 5.0810 2.5744 1.974 0.04875 *
## num_quintile40~60% 8.7599 2.6930 3.253 0.00119 **
## num_quintile60~80% 30.4064 2.3947 12.697 < 2e-16 ***
## num_quintile80~100% 36.1535 2.6269 13.763 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.38 on 843 degrees of freedom
## (39 observations deleted due to missingness)
## Multiple R-squared: 0.4926, Adjusted R-squared: 0.4884
## F-statistic: 116.9 on 7 and 843 DF, p-value: < 2.2e-16
In the final multivariate models for men and ladies,nearly all the predictors still have statistically significant relationship with season best score of free skating after adjusting the rest of the biography variables, except for the association of age with the outcome among ladies. Around 50% of the variance of season best score of free skating can be explained by age, height, career length and country rank.
##Comparing the point estimate of coefficients with 95% conficence interval between sex.
fit_fs<- bsb_fs %>% group_by(category) %>%
do(tidy(lm(score~ age+height+career_length+num_quintile,data= .),data=.))
fit_fs<- tbl_df(fit_fs)
fit_fs %>% filter(term=="age") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('FS Estimated based on Age')+geom_hline(yintercept = 0)
fit_fs %>% filter(term=="height") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('FS Estimated based on Height')+geom_hline(yintercept = 0)
fit_fs %>% filter(term=="career_length") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+
geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+
facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('FS Estimated based on Career Length')+geom_hline(yintercept = 0)
fit_fs %>% filter(term%in%c("num_quintile20~40%","num_quintile40~60%",
"num_quintile60~80%","num_quintile80~100%")) %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(category~.) +
ggtitle('FS Estimated based on Country Rank')+geom_hline(yintercept = 0)
We can see the magnitude of the associations of age with season best score of free skating varies by sex.
seb<- single_event_bio
seb %>% filter(is.na(name)==TRUE) %>% summarise(n())
## Source: local data frame [1 x 1]
##
## n()
## (int)
## 1 142
outof1392singles, there are 142 missing biography info.
#Remove observations with missing biography
seb<- seb %>% filter(is.na(name)==FALSE)
In this part, we ran linear regression to explore the associations between biography variables and individual event total scores for male and female skaters.
seb %>% ggplot(aes(age,total,color=category))+geom_point()+
geom_smooth(method="lm")+facet_grid(.~category) +
ggtitle('Association between Age and Event Total Score')
seb %>% group_by(category) %>% do(tidy(lm(total~ age,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 Ladies (Intercept) 142.8282261 7.9832861 17.89090671 1.853314e-57
## 2 Ladies age 0.0125651 0.3709298 0.03387461 9.729887e-01
## 3 Men (Intercept) 156.7544514 11.3893093 13.76329738 4.922588e-38
## 4 Men age 1.4914508 0.4893586 3.04776627 2.398448e-03
Age has a significantly positive relationship with event total score for men only.
seb %>% ggplot(aes(height,total,color=category))+
geom_point()+geom_smooth(method="lm")+facet_grid(.~category) +
ggtitle('Association between Height and Event Total Score')
seb %>% group_by(category) %>% do(tidy(lm(total~ height,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 Ladies (Intercept) 74.1585103 42.3181108 1.752406 8.023151e-02
## 2 Ladies height 0.4254202 0.2607342 1.631624 1.033006e-01
## 3 Men (Intercept) 357.9579385 45.6126343 7.847780 1.740790e-14
## 4 Men height -0.9596096 0.2625566 -3.654868 2.780228e-04
Height has a significantly negative relationship with event total score for men.
seb %>% ggplot(aes(career_length,total,color=category))+
geom_point()+geom_smooth(method="lm")+
facet_grid(.~category) +
ggtitle('Association between Career Length and Event Total Score')
seb %>% group_by(category) %>% do(tidy(lm(total~ career_length,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic p.value
## (chr) (chr) (dbl) (dbl) (dbl) (dbl)
## 1 Ladies (Intercept) 131.675512 5.2065579 25.290319 1.400202e-95
## 2 Ladies career_length 0.765271 0.3186156 2.401863 1.662641e-02
## 3 Men (Intercept) 145.342432 6.5874661 22.063481 6.849744e-81
## 4 Men career_length 2.717931 0.3776833 7.196323 1.716764e-12
For both ladies and men, career length has a significantly positive relationship with event total score.
seb %>% ggplot(aes(num_quintile,total,color=category))+
geom_boxplot()+facet_grid(category~.) +
ggtitle('Association between Country Rank and Event Total Score')
seb %>% group_by(category) %>%
do(tidy(lm(total~ num_quintile,data= .),data= .))
## Source: local data frame [10 x 6]
## Groups: category [2]
##
## category term estimate std.error statistic
## (chr) (chr) (dbl) (dbl) (dbl)
## 1 Ladies (Intercept) 116.585946 2.966308 39.303382
## 2 Ladies num_quintile20~40% 14.371673 4.068227 3.532663
## 3 Ladies num_quintile40~60% 6.751807 4.014349 1.681918
## 4 Ladies num_quintile60~80% 35.705840 3.481530 10.255790
## 5 Ladies num_quintile80~100% 46.626657 3.641256 12.805103
## 6 Men (Intercept) 166.613208 3.367635 49.474836
## 7 Men num_quintile20~40% 6.956973 4.708618 1.477498
## 8 Men num_quintile40~60% 10.805792 4.719061 2.289818
## 9 Men num_quintile60~80% 42.314174 4.062066 10.416910
## 10 Men num_quintile80~100% 44.565022 4.885006 9.122818
## Variables not shown: p.value (dbl)
Compared to skaters from country in the lowest quintile of number of skaters, skaters from country in higher quintile have greater event total score.
##Multivariate linear regression model for ladies
fit_l<- seb %>% filter(category=="Ladies") %>%
lm(total~ age+height+career_length+num_quintile, data= .)
summary(fit_l)
##
## Call:
## lm(formula = total ~ age + height + career_length + num_quintile,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.35 -17.82 1.19 17.35 62.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.0806 34.9298 1.319 0.18762
## age -1.4905 0.5829 -2.557 0.01082 *
## height 0.4129 0.2123 1.945 0.05229 .
## career_length 2.3060 0.4961 4.649 4.16e-06 ***
## num_quintile20~40% 12.6395 3.9253 3.220 0.00136 **
## num_quintile40~60% 4.9267 3.9492 1.248 0.21273
## num_quintile60~80% 36.6077 3.3582 10.901 < 2e-16 ***
## num_quintile80~100% 45.9865 3.6088 12.743 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.29 on 566 degrees of freedom
## (18 observations deleted due to missingness)
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.361
## F-statistic: 47.25 on 7 and 566 DF, p-value: < 2.2e-16
##Multivariate linear regression model for men
fit_m<- seb %>% filter(category=="Men") %>%
lm(total~ age+height+career_length+num_quintile, data= .)
summary(fit_m)
##
## Call:
## lm(formula = total ~ age + height + career_length + num_quintile,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95.989 -22.242 -1.045 22.180 117.290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 284.9161 42.9873 6.628 7.23e-11 ***
## age -2.0043 0.6646 -3.016 0.00267 **
## height -0.7091 0.2384 -2.974 0.00305 **
## career_length 3.1544 0.5307 5.944 4.56e-09 ***
## num_quintile20~40% 5.9951 4.5885 1.307 0.19184
## num_quintile40~60% 11.3235 4.6101 2.456 0.01430 *
## num_quintile60~80% 36.9621 4.0160 9.204 < 2e-16 ***
## num_quintile80~100% 42.4191 4.7499 8.930 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.4 on 641 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.2838, Adjusted R-squared: 0.276
## F-statistic: 36.28 on 7 and 641 DF, p-value: < 2.2e-16
In the final multivariate model for ladies, the association between age and event total score became statistically significant after adjusting the rest of the biography variables. These biography variables account for 36% of the variance of event total score, which is less than the 50% for explaining season best scores.
In the final multivariate model for men, the associations of biography variables with event total score changed slightly after adjusting the rest of the biography variables, except for age became negatively associated with outcome. However, these biography variables can account only 28% of the variance of event total score for men, which is 8% less than the proportion explained by biography variables of ladies and is 20% less than the proportion for explaining season best scores among men.
##Comparing the point estimate of coefficients with 95% conficence interval between sex.
fit<- seb %>% group_by(category) %>%
do(tidy(lm(total~ age+height+career_length+num_quintile,data= .),data=.))
fit<- tbl_df(fit)
fit %>% filter(term=="age") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('Estimate of Event Total Score by Age')+geom_hline(yintercept = 0)
##After adjusting for height,career_length and num_quintile, the association of age with total score became negative.
fit %>% filter(term=="height") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
ggtitle('Estimate of Event Total Score by Height')+geom_hline(yintercept = 0)
##After adjusting for height,career_length and num_quintile, the association of height with total score isn't statistically significant.
fit %>% filter(term=="career_length") %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+
geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+
coord_fixed(ratio=1.5) +
ggtitle('Estimate of Event Total Score by Career Length')+geom_hline(yintercept = 0)
##After adjusting for height,career_length and num_quintile, the positive association of career length with total score is statistically significant.
fit %>% filter(term%in%c("num_quintile20~40%","num_quintile40~60%",
"num_quintile60~80%","num_quintile80~100%")) %>%
mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(category~.) +
ggtitle('Estimate of Event Total Score by Country Rank')+geom_hline(yintercept = 0)
We can see the opposite direction of the associations of height with event total score comparing ladies to men. Taller skaters tend to score higher among ladies, however, shorter skaters are more likely to score higher among men, holding other variables constant. It is also interesting the note that holding career length constant, younger skaters tend to score higher for both ladies and men.
We predicted Season Best Scores (SBS) from different machine learning methods.
Instead of predicting exact SBS, we would like to keep it simple by separating SBS into quartiles based on total score from 2015. Q1 is the highest scores while Q4 is the lowest. Our goal was be to predict 2015 quartile of Season Best Total Score correctly by using past scores and biography data.
Due to sample size, we focused only on single men and single ladies. We looked at the predicitive models for men and ladies separately because their SBS distributions were very different, as discovered during exploratory analysis.
The machine learning methods we examined were kmean, knn and LDA. For KNN and LDA, we trained with 80% of the data set and used 20% of the data as the test set. We evaluated the accuracy of the method by calculating percentage of data with predicted quartile, same as actual quartile for 2015 SBS. We applied 100 replications for each method.
library(cowplot)
library(stringr)
library(caret)
library(dendextend)
First we reformated the table.
cleaned.df <- seasonbestsingle_bio %>%
mutate(UID = paste(name,category, sep="_")) %>%
mutate(UID = paste(UID,program_type, sep="_")) %>%
filter(!is.na(name))
scorespread.df <-cleaned.df %>%
filter(program_type == "to") %>%
select(UID, score, season_start_year) %>%
mutate(season_start_year = paste0("yr_",as.character(season_start_year))) %>%
#select(name, category, score, season_start_year) %>%
spread(season_start_year, score) %>%
filter(!is.na(yr_2015))
info.df <- cleaned.df %>%
filter(season_start_year == 2015) %>%
mutate(countryQuintile_by_skaterNum = num_quintile) %>%
select(UID, first_name, last_name, age, career_length,countryQuintile_by_skaterNum, height, country, category, name) %>%
distinct()
labels<- c("4th Q","3rd Q","2nd Q", "1st Q")
scorespread.m.df <-
scorespread.df %>%
filter(str_sub(UID, start= -4) == "m_to") %>%
mutate(SBS_2015_group =cut(yr_2015,
breaks=quantile(yr_2015,prob=seq(0,1,0.25)),
include.lowest=TRUE,
labels=labels))
scorespread.l.df <-
scorespread.df %>%
filter(str_sub(UID, start= -4) == "l_to") %>%
mutate(SBS_2015_group =cut(yr_2015,
breaks=quantile(yr_2015,prob=seq(0,1,0.25)),
include.lowest=TRUE,
labels=labels))
scorespread.m.df <- scorespread.m.df %>%
left_join(info.df, by ="UID")
scorespread.l.df <- scorespread.l.df %>%
left_join(info.df, by ="UID")
scorespread.df<-bind_rows(scorespread.m.df,scorespread.l.df)
rm(cleaned.df)
We then created a matrix with data to use for different machine learning methods.
as.numeric.factor <- function(x) {seq_along(levels(x))[x]}
formatMatrix <- function(df){
X <-
df %>%
select(-first_name,-last_name, -name, - SBS_2015_group, -yr_2015, -country, -UID) %>%
mutate(category = as.factor(category))
X[is.na(X)] <- 0
rownames(X) <-df$UID
return (X)
}
data.m.df <-formatMatrix(scorespread.m.df)
data.l.df <-formatMatrix(scorespread.l.df)
#X <- X %>%
#mutate(category = as.numeric.factor(category)) %>%
#mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))
#hierarchical clustering
## function to set label color
labelCol <- function(x) {
if (is.leaf(x)) {
## fetch label
label <- attr(x, "label")
code <- substr(label, 1, 1)
## use the following line to reset the label to one letter code
# attr(x, "label") <- code
attr(x, "nodePar") <- list(lab.col=colorCodes[code])
}
return(x)
}
colorCodes <- c(A="red", B="green", C="blue", D="yellow")
for( i in 1:2){
if (i == 1) {
data <-data.m.df
rownames(data)<-str_replace(rownames(data), "_m_to","")
title <- "Hierarchical clustering for men"
namesource.df <-scorespread.m.df
}
else {
data <-data.l.df
rownames(data)<-str_replace(rownames(data), "_l_to","")
title <- "Hierarchical clustering for ladies"
namesource.df <-scorespread.l.df
}
data <- dist(data, method = c("euclidian"))
hcdata <- hclust(data, method = c("average"))
## apply labelCol on all nodes of the dendrogram
dend <- as.dendrogram(hcdata)
name.df <- data.frame(labels(dend))
colnames(name.df)[1] <-"name"
code <- name.df %>%
left_join(namesource.df, by = "name") %>%
select(SBS_2015_group)
code <- code$SBS_2015_group
code <- str_replace(code, "1st Q","red")
code <- str_replace(code, "2nd Q","green")
code <- str_replace(code, "3rd Q","blue")
code <- str_replace(code, "4th Q","black")
dend %>%
set("labels_cex", 0.4) %>%
set("labels_colors", code) %>%
plot(main = title)
}
Due to the number of samples, it is very hard to read the label, so we colored the label by SBS 2015 quartile . Each label is a name of a skater. Q1 skaters are in red, Q2 skaters are in green, Q3 skaters are in blue, and Q4 skaters are in black.
From hierarchical clustering plot, for men we can see that Q1 skaters are grouped together at the far left corner and far right corner. Q4 skaters are in the middle. The separation between Q2 and Q3 are less obvious.
For ladies, Q1 skaters are grouped together at the far left corner and far right corner. The separation between Q2, Q3, and Q4 are less obvious.
Overall, hierarchical clustering identified the top performers of 2015 well, but did not identify other quartiles as clearly.
#men
data <- data.m.df %>%
mutate(category = as.numeric.factor(category)) %>%
mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))
pc.cr.m = prcomp(data)
summary(pc.cr.m)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 177.1683 82.7412 60.97076 47.41208 36.10341
## Proportion of Variance 0.6441 0.1405 0.07628 0.04613 0.02675
## Cumulative Proportion 0.6441 0.7846 0.86087 0.90700 0.93375
## PC6 PC7 PC8 PC9 PC10 PC11
## Standard deviation 35.46903 33.04967 29.22991 4.18670 2.1605 1.30981
## Proportion of Variance 0.02582 0.02241 0.01753 0.00036 0.0001 0.00004
## Cumulative Proportion 0.95956 0.98198 0.99951 0.99987 1.0000 1.00000
## PC12
## Standard deviation 0
## Proportion of Variance 0
## Cumulative Proportion 1
#ladies
data <- data.l.df %>%
mutate(category = as.numeric.factor(category)) %>%
mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))
pc.cr.l = prcomp(data)
summary(pc.cr.l)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 111.8550 59.4855 41.51743 36.62834 34.04632
## Proportion of Variance 0.5787 0.1637 0.07973 0.06206 0.05362
## Cumulative Proportion 0.5787 0.7424 0.82215 0.88421 0.93782
## PC6 PC7 PC8 PC9 PC10 PC11
## Standard deviation 26.51272 22.3965 10.88794 3.8908 2.05185 1.31120
## Proportion of Variance 0.03251 0.0232 0.00548 0.0007 0.00019 0.00008
## Cumulative Proportion 0.97034 0.9935 0.99903 0.9997 0.99992 1.00000
## PC12
## Standard deviation 0
## Proportion of Variance 0
## Cumulative Proportion 1
From PCA summary, for men, a total of 78.46% of the variation in the data is captured in the first two principle components. If we use the first 4 PCA components, we can capture 90.70% of variation in the data. For ladies, a total of 74.24% of the variation in the data is captured in the first two principle components. If we use the first 4 PCA components, we can capture 88.42% of the variation in the data.
We created a helper function below to calculate accuracy for k-mean clustering.
calculate.confusion <- function(states, clusters)
{
# generate a confusion matrix of cols C versus states S
d <- data.frame(state = states, cluster = clusters)
td <- as.data.frame(table(d))
# convert from raw counts to percentage of each label
pc <- matrix(ncol=max(clusters),nrow=0) # k cols
for (i in 1:4) # 4 labels
{
#i <-1
total <- sum(td[td$state==td$state[i],3])
pc <- rbind(pc, td[td$state==td$state[i],3]/total)
}
rownames(pc) <- td[1:4,1]
return(pc)
}
assign.cluster.labels <- function(cm, k)
{
# take the cluster label from the highest percentage in that column
cluster.labels <- list()
for (i in 1:k)
{
cluster.labels <- rbind(cluster.labels, row.names(cm)[match(max(cm[,i]), cm[,i])])
}
return(cluster.labels)
}
calculate.accuracy <- function(states, clabels)
{
matching <- Map(function(state, labels) { state %in% labels }, states, clabels)
tf <- unlist(matching, use.names=FALSE)
return (sum(tf)/length(tf))
}
getKMeanAccuracy<- function(data, y, k){
cl <- kmeans(data, centers=k)
conf.mat <- calculate.confusion(y, cl$cluster)
cluster.labels <- assign.cluster.labels(conf.mat,k)
acc <- calculate.accuracy(y, cluster.labels[cl$cluster])
return (acc)
}
accuracy.kmean.df <-
data.frame(
category = character(),
k = integer(),
PCA = integer(),
accuracy = double(),
sd = double(),
method = character()
)
for (i in 1:2) {
if (i == 1) {
pc.cr <- pc.cr.m
title.add <- "for men"
SBS_2015_group <-scorespread.m.df$SBS_2015_group
category <- "m"
}
else {
pc.cr <- pc.cr.l
title.add <- "for ladies"
SBS_2015_group <-scorespread.l.df$SBS_2015_group
category <- "l"
}
data <-pc.cr$x[,1:2]
pca1 <-pc.cr$x[,1]
pca2 <-pc.cr$x[,2]
# find optimal number of cluster for k means
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(data,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = paste("optimal k-mean cluster with first 2 PCA", title.add))
data <-pc.cr$x[,1:4]
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(data,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = paste("optimal k-mean cluster with first 4 PCA", title.add))
pc.df<-data.frame(PCA1=pca1, PCA2=pca2, SBS_2015_group=factor(SBS_2015_group))
print(ggplot(pc.df, aes(x=PCA1, y=PCA2, color=SBS_2015_group)) +
geom_point() + ggtitle(paste("SBS of 2015 using first 2 PCA",title.add)))
data <-pc.cr$x[,1:2]
cl <- kmeans(data, centers=4)
pc.df<-data.frame(PCA1=pca1, PCA2=pca2, Cluster=factor(cl$cluster))
print(ggplot(pc.df, aes(x=PCA1, y=PCA2, color=Cluster)) +
geom_point() + ggtitle(paste("kmean using first 2 PCA",title.add)))
data <-pc.cr$x[,1:4]
cl <- kmeans(data, centers=4)
pc.df<-data.frame(PCA1=pca1, PCA2=pca2, Cluster=factor(cl$cluster))
print(ggplot(pc.df, aes(x=PCA1, y=PCA2, color=Cluster)) +
geom_point() + ggtitle(paste("kmean using first 4 PCA",title.add)))
for (p in 2:6){
data <-pc.cr$x[,1:p]
for (k in 4:10){
accuracy.kmean<-replicate(100,getKMeanAccuracy(data, SBS_2015_group, k))
new.df <- data.frame(category = category, k =k , PCA = p,accuracy= mean(accuracy.kmean),sd= sd(accuracy.kmean),method = "kmean")
accuracy.kmean.df <- bind_rows(accuracy.kmean.df, new.df)
}
#data <-pc.cr$x[,1:2]
#accuracy.kmean.pca2<-replicate(100,getKMeanAccuracy(data, SBS_2015_group, k))
#cat("For cluster =", k, ", kmean with using first 2 PCA acheive accuracy with mean = ", mean(accuracy.kmean.pca2), ", sd =",sd(accuracy.kmean.pca2), " ",title.add, "\n")
#data <-pc.cr$x[,1:4]
#accuracy.kmean.pca4<-replicate(100,getKMeanAccuracy(data, SBS_2015_group, k))
#cat("For cluster =", k, ", kmean with using first 4 PCA acheive accuracy with mean = ", mean(accuracy.kmean.pca4), ", sd =",sd(accuracy.kmean.pca4), " ",title.add, "\n")
}
}
accuracy.kmean.df%>%
ggplot(aes(x=k,y=accuracy))+
geom_boxplot(aes( group = k)) +
xlab("k")+
ggtitle("Kmean accuray")+
facet_wrap(~ category, scale = "free")
accuracy.kmean.df%>%
ggplot(aes(x=PCA,y=accuracy))+
geom_boxplot(aes( group = PCA)) +
xlab("PCA")+
ggtitle("Kmean accuray")+
facet_wrap(~ category, scale = "free")
do.call(rbind,lapply(split(accuracy.kmean.df,accuracy.kmean.df$category),function(chunk) chunk[which.max(chunk$accuracy),]))
## Source: local data frame [2 x 6]
##
## category k PCA accuracy sd method
## (chr) (int) (int) (dbl) (dbl) (chr)
## 1 l 10 2 0.5021176 0.01427521 kmean
## 2 m 10 3 0.4596682 0.01710113 kmean
The table and graph above show the optimal k and PCA to use for knn. Around 50% was the optimal accuracy achieved.
For men, if we used the first 2 PCA or first 4 PCA, the optimal cluster is 5 or 6. For ladies, if we used the first 2 PCA or first 4 PCA, the optimal cluster is 6. Since we divided SPB 2015 into quartiles, we used 4 as our cluster number, which is also close to the optimal k-mean cluster.
Overall, kmean has higher accurarcy for ladies than for men. Increasing k helps to improve accuracy. But this is expected because we created more clusters and each cluster had a label of one of the quartiles. Increasing PCA does not always improve accuracy.
library(class)
getAccuracyForKNN <-function(X, k){
inTrain <- createDataPartition(y = X$SBS_2015_group, p=0.8)$Resample1
train <- slice(X, inTrain)
test <- slice(X, -inTrain)
train.class <- as.factor(train$SBS_2015_group)
test.class <- as.factor(test$SBS_2015_group)
train <- train %>% select(-SBS_2015_group)
test <- test %>% select(-SBS_2015_group)
#train <- train[, -ncol(train)]
#test <- test[, -ncol(test)]
test<-test%>%filter(complete.cases(.))
y_hat_knn <- knn(train = train,test = test, cl = train.class, k = k, prob=FALSE)
acc <- mean(test.class==y_hat_knn)
return(acc)
}
accuracy.knn.df <-
data.frame(
category = character(),
parameter = integer(),
accuracy = double(),
sd = double(),
method = character()
)
for ( i in 1:2){
if (i == 1) {
SBS_2015_group<- scorespread.m.df$SBS_2015_group
X <- data.m.df
title.add <- "for men"
category <- "m"
}else {
SBS_2015_group<- scorespread.l.df$SBS_2015_group
X <- data.l.df
title.add <- "for ladies"
category <- "l"
}
X <- X %>%
mutate(category = as.numeric.factor(category)) %>%
mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))
X <- bind_cols(X,data.frame(SBS_2015_group))
for (k in 1:10){
accuracy.knn<-replicate(100,getAccuracyForKNN(X, k))
# cat("For neighbor =", k, ", knn acheives accuracy with mean = ", mean(accuracy.knn), ", sd =",sd(accuracy.knn), " ",title.add, "\n")
new.df <- data.frame(category = category, parameter =k ,accuracy= mean(accuracy.knn),sd= sd(accuracy.knn),method = "knn")
accuracy.knn.df <- bind_rows(accuracy.knn.df, new.df)
}
}
accuracy.knn.df%>%
ggplot(aes(parameter,accuracy))+
geom_point(aes(color=category)) +
xlab("number of neighbor") +
ggtitle("KNN accuray")
do.call(rbind,lapply(split(accuracy.knn.df,accuracy.knn.df$category),function(chunk) chunk[which.max(chunk$accuracy),]))
## Source: local data frame [2 x 5]
##
## category parameter accuracy sd method
## (chr) (int) (dbl) (dbl) (chr)
## 1 l 6 0.5289583 0.06357072 knn
## 2 m 4 0.4667500 0.06779863 knn
The table and graph above show the optimal number of neighbor to use for knn. Knn has a higher accuracy to preduict SBS quartile for ladies then for men.
library(MASS)
getAccuracyForLDA <-function(X, k){
inTrain <- createDataPartition(y = X$SBS_2015_group, p=0.8)$Resample1
train <- slice(X, inTrain)
test <- slice(X, -inTrain)
train.class <- as.factor(train$SBS_2015_group)
test.class <- as.factor(test$SBS_2015_group)
train <- train %>% dplyr::select(-SBS_2015_group, -category)
test <- test %>% dplyr::select(-SBS_2015_group, -category)
#train <- train[, -ncol(train)]
#test <- test[, -ncol(test)]
test<-test%>%filter(complete.cases(.))
data.lda <- lda(train.class~.,
data=train)
#Project data on linear discriminants
data.lda.y_hat <- predict(data.lda, test)
#Extract scaling for each predictor
#data2.lda <- data.frame(varnames=rownames(coef(data.lda)), coef(data.lda))
acc <- mean(test.class==data.lda.y_hat$class)
return(acc)
}
accuracy.lda.df <-
data.frame(
category = character(),
#parameter = integer(),
accuracy = double(),
sd = double(),
method = character()
)
for ( i in 1:2){
if (i == 1) {
SBS_2015_group<- scorespread.m.df$SBS_2015_group
X <- data.m.df
title.add <- "for men"
category <- "m"
}else {
SBS_2015_group<- scorespread.l.df$SBS_2015_group
X <- data.l.df
title.add <- "for ladies"
category <- "l"
}
X <- X %>%
mutate(category = as.numeric.factor(category)) %>%
mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))
X <- bind_cols(X,data.frame(SBS_2015_group))
accuracy.lda<-replicate(100,getAccuracyForLDA(X, k))
new.df <- data.frame(category = category ,accuracy= mean(accuracy.lda),sd= sd(accuracy.lda),method = "lda")
accuracy.lda.df <- bind_rows(accuracy.lda.df, new.df)
}
do.call(rbind,lapply(split(accuracy.lda.df,accuracy.lda.df$category),function(chunk) chunk[which.max(chunk$accuracy),]))
## Source: local data frame [2 x 4]
##
## category accuracy sd method
## (chr) (dbl) (dbl) (chr)
## 1 l 0.5320833 0.06078742 lda
## 2 m 0.4597500 0.06541389 lda
The table and graph above show the optimal number of neighbors to use for LDA. LDA achieves the highest accuracy for ladies after 100 replications from using 20% of data as test set while KNN achieves the highest accuracy for men. Both have similar accuracy standard deviation of 6%.
Each method has higher accuracy to predict 2015 SBS quartile for ladies than for men.
The reason why the accuracy is not high can be that not all skater have every SBS score since 2008, ex. some skaters only started since 2012.
In the future, we can try other machine learning method, like SVM.